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ABSTRACT 

This paper discusses the theory and application of learning Boolean functions that are concentrated in the Fourier 
domain. We first estimate the VC dimension of this function class in order to establish a small sample complexity 
of learning in this case. Next, we propose a computationally efficient method of empirical risk minimization, and 
we apply this method to the MNIST database of handwritten digits. These results demonstrate the effectiveness 
of our model for modern classification tasks. We conclude with a fist of open problems for future investigation. 
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1. INTRODUCTION 


The last year has produced several breakthroughs in classification. Deep neural networks now match human-level 
performance in both facial recognitiorfi^ and general image recognition, ESI and they also outperform both Apple 
and Google’s proprietary speech recognizers.^! These advances leave one wondering about their implications for 
the general field of machine learning. Indeed, Google and Facebook have actively acquired top talent in deep 
neural networks to pursue these leadslUEI] As a parallel pursuit, many have sought theoretical justification for 
the unreasonable effectiveness of deep neural networks ISHUEZI Unfortunately, the theory is still underdeveloped, 
as we currently lack a theoretical grasp of the computational complexity of learning with modern deep neural 
networks. Answers to such fundamental questions will help illustrate the scope of these emerging capabilities. 


Observe that neural networks resemble circuit implementations of Boolean functions. Indeed, a circuit 
amounts to a directed acyclic graph with n input nodes and a single output node, along with intermediate 
nodes that represent Boolean logic gates (such as ANDs, ORs, threshold gates, etc.). As such, one may view 
circuits as discrete analogies for neural networks. To date, there is quite a bit of theory behind the learnability of 
Boolean functions with sufficiently simple circuit implementationslilllllllill The main idea is that such functions 
enjoy a highly concentrated Fourier transform due to a clever application of Hastad’s Switching Lemma.!^ Passing 
through the analogy, one might then hypothesize that the real-world functions that are well approximated by 
learnable deep neural networks also enjoy a highly concentrated Fourier transform—this hypothesis motivates 
our approach. 


This paper discusses the theory and application of learning Boolean functions that are concentrated in the 
Fourier domain. The following section provides some background material on statistical learning theory to help 
set the stage for our investigation. We then prove in Section 3 that the sample complexity of learning Boolean 
functions of concentrated spectra is small. Section 4 proposes a learning algorithm as a first step towards tackling 
the computational complexity, and then Section 5 illustrates how well our model performs on a real-world dataset 
(namely, the MNIST database of handwritten digit^^^. We conclude in Section 6 with a list of open problems. 


2. BACKGROUND 

The objective is to estimate an unknown labeling function /: {±1}" —>■ {±1}- A sample {xULi — {^1}" is 
drawn i.i.d. according to some unknown distribution p, and we receive the labeled training set {{xi, 

The quality of our estimate / : {±1}" —>■ {±1} will be evaluated in terms of the risk functional 

a:G{±l}" 
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Risk is commonly approximated by empirical risk using a random sample {yi}^i drawn i.i.d. according to p: 
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In practice, empirical risk is used to evaluate an estimate / with the help of a labeled test set (which is disjoint 
from the training set). Similarly, it makes sense to pick / in such a way that minimizes empirical risk over 
the training set. However, the training sample only covers a small fraction of the sample space {±1}", so how 
do we decide which values / should take beyond this sample? 

The trick is to restrict our empirical risk minimization to a “simple” function class C C {g: {±1}" —)■ {±1}}. 
Intuitively, if we suspect that / is close to some member / of a small set C, and a large training set happens to 
nearly match one such member, then chances are small that this occurred by mere coincidence, and so we should 
expect R{f, f) to be small. In general, C doesn’t need to be small, as it suffices for C to enjoy a broader notion 
of simplicity: 

Definition 1. The function class C C { 5 : {±1}” — )■ {±1}} is said to shatter {xiYi=i — {^1}” if every 
choice of labels yi,..., G {±1}, there exists a function g G C such that g{xi) = yi. The VC dimension of C 
is the size of the largest set that C shatters. 


We consider C to be simple if its VC dimension is small. By counting, it is clear that the VC dimension 
of C is < log 2 \C\, meaning small sets are necessarily simple. The following result illustrates the utility of VC 
dimension as a notion of simplicity: 

Theorem 2 (obtained by combining equations (3.15) and (3.23) from VapnilP^^). Fix C C {g: {±1}" —>■ {±1}} 
and let h denote its VC dimension. Pick f: {±1}" —>■ {±1} and draw {xi}f^i C {±1}" i.i.d. according to some 
(unknown) distribution p. Then with probability > 1 — rj, 


R{gJ)<RA9,f) + 


h{\og{2e/h) + 1) - log(?7/4) 


for all g G C simultaneously. Here, the probability is on {xifl^^. 


In words, Theorem states that the risk of an estimate is small provided its empirical risk is small over a 
sufficiently large training set (namely, £ ^ h). This suggests three properties that we want our function class C 
to satisfy: 


• Simple. We want the VC dimension of C to be small, so as to allow for a small sample complexity. 

• Admits fast optimization. We want empirical risk minimization over C to be computationally efficient. 

• Models reality. Given an application, we want the true function to be close to some member of C. 


In the remainder of this paper, we study whether Boolean functions with concentrated spectra form a function 
class which satisfies these desiderata. To be explicit, the following defines the function class of interest: 

Definition 3. Let Cn^k denote the class of all functions g: {±1}” —>■ {±1} for which there exist index sets 
Si,..., Sk C [n] and coefficients oi,..., a*, S K such that 

g{x) = sign ( X! II ^ {il}”- (1) 

V i=l j&Si / 

In the following section, we estimate Cn,Cs VC dimension. Next, Section 4 proposes a method for performing 
empirical risk minimization over Cn,k- Finally, we apply this method to the MNIST database of handwritten 
digitin Section 5 to illustrate how well Cn,k models reality (at least in one application). 




3. ESTIMATING THE VC DIMENSION 


Considering Theorem we desire a function class of small VC dimension, as this will allow us to get away with 
a small training set. In this section, we estimate the VC dimension of Cn,k, the class of functions of the form Q. 
To this end, it is helpful to identify how Q is related to the Walsh-Hadamard transform W: £ 2 (^ 2 ) € 2 (^ 2 )) 

defined by 

{Wz)iv) := Vu e 

Taking Su ■= {j ■ uj = 1} and xj := (— 1 )’'^ € {± 1 }, we equivalently have 

(IVz)(log_i(x)) = n ^ {±1}”- 

Note that every real polynomial over {±1}" is of the form [Wz) o log_]^. When the coefficients are fc-sparse, 
taking the sign of this polynomial produces a member of Cn,k- 

Recall the matrix representation of W, namely 

- 1 (gin 

W = ^ ^ 

1-1 

The vector Wz lists all 2" possible outputs of {Wz) o log_i. As such, we may identify Cn,k with 

{sign(lTz) : ||z||o < k}. (2) 

We use this identification to prove the main result of this section: 

Theorem 4. The VC dimension of Cn,k ^ 2nfc — O(fclogfc). 


Proof. Since the VC dimension is < log 2 |C„_fe|, it suffices to estimate \Cn,k\- Considering ([^, this quantity is 
the number of orthants in that intersect the union of subspaces {Wz : ||z||o < fc}- By Theorem V.l of van 
der Berg and Friedlander,!^ each subspace intersects at most orthants, and so 


\Cn.k\ < 
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Taking logs of both sides and applying the bound (^) < (“)*’ then gives the result. 
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As a complementary result, the following illustrates how tight our bound is: 
Proposition 5. The VC dimension of Cn,k is > max{n, fc}. 


Proof. It suffices to find a subcollection S of max{n, k} row indices such that 

{sign(Wsz):||z||o<fc} = {±l}l^l, 

where Ws denotes the [S'! x 2" submatrix of rows from W indexed by S. In the case where n > k, let S index 
the rows of W which have the form 

w, := (1, (g) (1, -1) (g (1, l)®("-b 

for some i G [n]. Note that for every u G 'Z'lf, the corresponding identity basis element is given by := {^>"=1 ^Ui, 
where cq := (1,0) and ei := (0,1). This then implies 

(IVse„)i = {eu,Wi} = (e„,,(l,-l)) = (-1)“*. 

As such, every member of {±1}” is can be expressed as for some identity basis element e„. In the remaining 

case where k > n, observe that W has rank 2" > fc, and so there exists a k x k submatrix A of rank k. Let S 
denote the row indices of A. Then every vector in {±1}^ can be expressed as Wsz, where z is supported on the 
column indices of A. □ 




In pursuit of better lower bounds, we need more techniques to tackle the notion of shattering. The proof of 
the following proposition provides some ideas along these lines: 

Proposition 6. Given an N x N matrix A, denote CA,k '■= {sign(Az) : || 2 :||o < k}. 

(a) CA,k fails to shatter a set of size 0{k^) if A is the Walsh-Hadamard transform matrix and k <C VN. 

(b) CA,k fails to shatter a set of size 0{k\og{N/k)) w.h.p. if the entries of A are i.i.d. Af{0, 1). 

Proof, (a) Define the sign rank of a matrix S with entries in {±1} to be the minimum rank of all matrices M 
satisfying MijSij > 0 for every Write iV = 2”, and observe that each column of A can be reshaped to be a 
2 L™/ 2 J ^ matrix of rank 1. Then reshaping sign(A 2 ;) in the same way produces a 21"/^! x 2l"/^l matrix 

of sign rank < k. By the left-hand inequality of equation (1) from Alon, Frankl and Rodl,^ for every m, there 
exists an m X m matrix S with entries in {±1} of sign rank > ml32. Taking m := 32(fc -I- 1) < 21"/^!, use the 
corresponding matrix S to construct a 21"/^! x 2l"/^l matrix S' by padding with ±ls. Then S' has sign rank 
> fc -I- 1 (implying S' ^ CA,k) regardless of how the padded entries are selected. As such, CA,k fails to shatter 
these mf entries. 

(b) We seek m such that for every / G CA,k, the first m entries of / are not all Is. Let Ai j denote the 
submatrix with row indices in I and column indices in J. Then equivalently, we seek m such that for every 
K C [A^] with \K\ = k, the subspace im(A[m],if) intersects the nonnegative orthant K>g uniquely at the origin. 
Since im(A[m]_^) is drawn uniformly from the Grassmannian of fc-dimensional subspaces in M™, we may apply 
Gordon’s Escape Through a Mesh Theorem (namely. Corollary 3.4 of GordorP). For this theorem, we use the 
fact that the Gaussian width of the positive orthant is ^/rnJ2 — 0(1/-^™), as established by Propositions 3.2 
and 10.2 in Amelunxen et al.l^ Then 

Pr (im(.4[,n = {0}) > 1 - 5 exp ( - 1 - o(^;^ + ;^)) )' 

Taking m = ck\og{N/k) for sufficiently large c, the union bound then gives 

Pr (im(A[^],;f) nK^o = W VAT C [N], |iF| = fc) > 1 - e-^('='°gW'=)). 

As such, CA,k fails to shatter the first m entries. □ 

4. EMPIRICAL RISK MINIMIZATION 

In this section, we consider the problem of empirical risk minimization over Cn,k- 

Problem 7. Let W denote the 2" x 2" Walsh-Hadamard transform matrix, and let z be some (nearly) fc-sparse 
vector in K^". Given a sample x of £ entries of / = sign(IFz), find / G Cn,k satisfying 

Rxif, f) < const ■ inin Rx{g,f) (3) 

g&C„,k 


in poly(n, k,£) time. 

This problem can be viewed as a combination of the sparse fast Walsh-Hadamard transforrrfiS and one-bit 
compressed sensing.^^ Computationally, the main difficulty seems to be achieving polynomial time in n rather 
than 2", and to do so, one must somehow take advantage of the rich structure provided by the Walsh-Hadamard 
transform. As a cheap alternative, this paper instead simplifies the problem by strengthening the assumptions, 
namely, that z is mostly supported on members of Z 2 with Hamming weight < d, that is, the polynomial 
{Wz) o log_i has degree at most d. This allows us to discard the vast majority of columns of W, leaving only 
0{n‘^) columns to consider. 

Let Wx,d denote the submatrix of W with row indices in the sample x and column indices of Hamming weight 
< d. Also, let fx denote the true function / restricted to the sample x. To isolate an estimate / G Cn,k, we first 






perform feature selection by finding the columns of Wx,d which look the most like fx- That is, we find the largest 
entries of \Wjj^fx\ and isolate the corresponding columns of Wx,d- Thanks to the discrete nature of W, some of 
these columns may be identical up to a global sign factor. As such, we collect columns of Wx,d corresponding 
the largest entries of \Wj^fx\ until we have k distinct columns up to sign. Let A denote the resulting £ x k 
matrix of columns. Then it remains to find a coefficient vector z such that fx ~ sign(Az), that is, to train a 
support vector machine. After finding z, we pick / S Cn,k according to Q by taking the coefficients to 

be the entries of z and taking the index sets to be Si = {j : Uj = 1}, where u G Zlf is the column index of W 
corresponding to the ith column of A. 

We note that alternatively, one could train an t'l-restricted support vector machin^^ to find a sparse z such 
that fx « sign{Wx,dz)-. 


i 

min - (f.h(Wx,dzh) (4) 

i=l ^ 

S.t. ||z||i<T. 

However, we found this to be slow in practice, even for small values of d. Still, we followed the intent of this 
method by applying a loose £i restriction to our support vector machine training, albeit after we performed 
feature selection. 

We note that for a fixed d, our two-step method (feature selection, then support vector machine training) 
runs in time which is polynomial in n, k and £. Unfortunately, we currently lack a performance guarantee of the 
form (|^. Instead, we apply this method to real-world data in the following section to illustrate its effectiveness 
(as well as the quality of the function model Cn,k)- 

5. IMPLEMENTATION WITH HANDWRITTEN DIGITS 

The MNIST database of handwritten digitcontains 5923 zeros and 6742 ones, and each digit image is repre¬ 
sented by a 28 X 28 matrix with entries in [0,1]. To keep runtime reasonable, we reduced the image to a 5 x 5 
matrix by convolving with the indicator function of a 5 x 5 block and sampling over a 5 x 5 grid. We then 
thresholded the entries to obtain vectors in {±1}^^; typical results of this process are illustrated in FigureA 
— 1 label was assigned to the zeros, and ones were similarly labeled with a 1. At this point, classihcation amounts 
to learning a function /: {±1}^® —{il}- 

After processing the data in this way, we implemented our method of feature selection and support vector 
machine training as detailed in Section [4 We assumed the polynomial / = (Wz) o log_^ has degree at most 
d, and we fix d = 3. We chose this valueoecause we found that increasing d greatly increases runtime without 
empirically improving the classifier. In order to choose a sparsity level fc, we performed feature selection and 
trained a support vector machine for multiple choices of k. Intuitively, taking k too small will overly simplify the 
model and fail to match the inherent complexities of the data, while large choices for k will lead to overfitting. 
Before picking k, we first let this parameter range from 10 to 280 in increments of 10, and for each value of k, we 
performed 10 experiments in which we chose disjoint training and testing sets of sizes 1500 and 2500 respectively 
from each label class. These sets were chosen uniformly at random without replacement. We decided to make 
the training set small relative to the entire database due to the long runtime required to train a support vector 
machine—we used a much larger training set after we identified the “optimal” k. 

Given a training sample, we performed feature selection as described in Section and trained a support 
vector machine as is in Q with a loose ii restriction (r = 1000). We then calculated the empirical risk using 
the test set and plotted the results in the left portion of Figure This figure shows the sparsity level k versus 
the mean empirical risk with error bars denoting one standard deviation from the mean. The large error bars 
for fc = 50 are due to a single extreme outlier. Observe that empirical risk decreases and plateaus at around 
k = 150, and while we would expect the curve to trend upward for larger k due to overfitting, we terminated our 
computations at fc = 180 due to computation time. 

For each trial of our experiment, we also recorded the misclassification rate in the training set. The difference 
between the test and training misclassification rates forms a proxy for i?(/, /) — Rxif, /)• Qualitatively, the plot 
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Figure 1. Sample of MNIST database of handwritten digits.^ (top row) Each image is a 28 x 28 matrix with entries in 
[0,1]. (bottom row) In order to minimize the runtime of learning, we decided to convert each image into a 5 x 5 image 
by convolving with the indicator function of a 5 x 5 block and sampling over a 5 x 5 grid. Finally, we thresholded the 
entries to produce a vector in {±1}^®. Identifying the label zero with —1 and one with 1, the classification task amounts 
to learning a function /: {±1}^® —>■ {±1}, and so we apply the method described in Section]^ 
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Figure 2. (left) For each A: = 10 : 10 : 280, we ran 10 trials of the following experiment: Draw a random training set of 
1500 zeros and 1500 ones, train a classifier with d — 3 and parameter k according to Section]^ draw a random test set of 
2500 zeros and 2500 ones (disjoint from the training set), and record the misclassification rate. Here, we plot the average 
rate along with errors bars indicating one standard deviation above and below. For k = 50, the error bars are large due to 
a single outlier. The misclassification rate plateaus after around k = 150, and so to minimize computational complexity, 
we selected k = 150 for our hnal classifier. For larger values of k, we expect the misclassification rate to increase due to 
overfitting, but long runtimes prevented us from performing such experiments, (right) For each trial of our experiment, we 
also recorded the misclassification rate in the training set. The difference between the test and training misclassification 
rates forms a proxy for R{f, f) — Rx{f, /). Qualitatively, the plot of these differences matches the behavior predicted by 
the square-root term in Theorem but the values in the plot are orders of magnitude smaller. This suggests that the 
guarantee provided by the theorem is conservative in our setting. 
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Figure 3. All 28 digits that were misclassified by our classifier. Can you guess which are which? We suspect the classifier 
would perform better if it received the original images instead of downsampled versions, but this would be computationally 
costly. Indeed, we believe new algorithms need to be developed before our techniques can compete with the state of the 
art. Spoiler: The last 7 digits are ones, and the rest are zeros. 



































of these differences matches the behavior predicted by the square-root term in Theorem but the values in the 
plot are orders of magnitude smaller. This suggests that the guarantee provided by the theorem is conservative 
in our setting. 

Since empirical risk fails to noticeably decrease after about k = 150, we selected this value for k and performed 
our variable selection and training process on a large training set. Specifically, we randomly chose a training 
set consisting of 4000 zeros and 4000 ones (approximately two thirds of the entire database). Our test set 
then consisted of 1900 zeros and 1900 ones. With these training and test sets, our choice of k achieved a 
misclassification rate of 0.74% after a total runtime of 160 seconds. Considering we greatly downsampled our data 
from the MNIST database and we only attempted to classify zeros and ones, there is no direct comparison to be 
made with existing results.l^ Still, it is worth mentioning that the best SVM classifier exhibits a misclassification 
rate (on all digits 0 through 9) of 0.56%, suggesting that our results are reasonable. To make this point stronger. 
Figure [^displays all 28 misclassified digits from the test set. We contend that a human would likely misclassify 
these digits as well. Can you pick out the zeros from the ones? 

6. CONCLUSION AND OPEN PROBLEMS 

This paper demonstrates the plausibility of learning Boolean functions with concentrated spectra, as well as its 
applicability to modern classification theory and application. However, this paper offers more questions than 
answers. For example, while we showed in Section 2 that the VC dimension h of Cn,k satisfies 

max{n, k} < h < 2nk — Oik log fc), 

we have yet to identify how h scales with n and k. This leads to our first open problem: 

Problem 8. Determine the VC dimension of Cn,k- 

In Section 3, we proposed an algorithm for empirical risk minimization over polynomials in Cn,k of degree at 
most d. However, we still don’t know if this algorithm produces an estimate / that satisfies a guarantee of the 
form (|^. 

Problem 9. Find a performance guarantee for the algorithm proposed in Section 3. 

Problemj^also remains open, but before this can be solved, one must first devise a candidate algorithm. This 
leads to the following intermediate problem: 

Problem 10. Let W denote the 2" x 2” Walsh-Hadamard transform matrix, and let z be some (nearly) fc-sparse 
vector in R^". Given a sample x of £ entries of / = sign(Wz), find / G Cn,k that well approximates / (at least 
empirically) in poly(n, k, £) time. 

This is perhaps the most important open problem in this paper. Considering our implementation in Section 5, 
the algorithm proposed in Section 3 exhibits certain computational bottlenecks due to the poor dependence on 
d. As such, the methods of this paper might fail to compete with state-of-the-art classification until we find a 
solution to Problem [TOl 

Finally, while Section 5 demonstrated the effectiveness of our model for handwritten digits, we have yet 
to determine the full scope of its applicability. This suggests the need for more numerical experiments, but 
there is also a theoretical result to seek. Indeed, our approach was motivated by a certain hypothesis, and the 
confirmation of this hypothesis remains an open problem: 

Problem 11. Prove that Boolean functions that are well approximated by learnable deep neural networks also 
enjoy a highly concentrated Fourier transform. 

Such a result would establish that empirical risk minimization over Cn,k amounts to a relaxation of the 
corresponding optimization over deep neural networks, and so our model would consequently inherit the real- 
world utility of such networks. 


ACKNOWLEDGMENTS 


This work was supported by an AFOSR Young Investigator Research Program award, NSF Grant No. DMS- 
1321779, and AFOSR Grant No. F4FGA05076J002. The views expressed in this article are those of the authors 
and do not reflect the official policy or position of the United States Air Force, Department of Defense, or the 
U.S. Government. 


REFERENCES 

[1] N. Alon, P. Frankl, V. Rodl, Geometrical realization of set systems and probabilistic communication 
complexity, FOGS (1985) 277-280. 

[2] D. Amelunxen, M. Lotz, M. B. McCoy, J. A. Tropp, Living on the edge: Phase transitions in convex 
programs with random data. Inform. Inference 3 (2014) 224-294. 

[3] J. Anden, S. Mallat, Deep scattering spectrum, IEEE Trans. Signal Proc. 62 (2014) 4114-4128. 

[4] J. Bruna, A. Szlan, Y. LeCun, Signal recovery from pooling representations. Available online: 
arXiv:1311.4025 

[5] A, Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, Y. LeCun, The loss surfaces of multilayer 
networks. Available online: arXiv:1412.0233 

[6] A. Efrati, Google beat Facebook for DeepMind, Creates ethics board, The In¬ 
formation, Jan. 27, 2014, Available online: https://www.theinformation.com/ 

Google-beat-Facebook-For-DeepMind-Creates-Ethics-Board 

[7] Y. Gordon, On Milman’s inequality and random subspaces which escape through a mesh in K", Geometric 
aspects of functional analysis (1986/87) 84-106. 

[8] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sen- 
gupta, A. Coates, A. Y. Ng, Deep Speech: Scaling up end-to-end speech recognition, Available online: 
arXiv:1412.5567 

[9] J. Hastad, Computational limitations of small-depth circuits, MIT Press, Cambridge, Mass., 1987. 

[10] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on 
ImageNet classification. Available online: arXiv:1502.01852 

[11] D. Hernandez, Facebooks quest to build an artificial brain depends on this guy. Wired, Aug. 14, 2014, 
Available online: http: //www. wired. com/2014/08/deep-learning-yainn-lecun/ 

[12] J. Jackson, A. Klivans, R. Servedio, Learnability beyond AC^, STOC (2012) 776-784. 

[13] L. Jacques, J. N. Laska, P. T. Boufounos, R. G . Baraniuk, Robust 1-bit compressive sensing via binary 
stable embeddings of sparse vectors, IEEE Trans. Inform. Theory 59 (2013) 2082-2102. 

[14] Y. LeCun, C. Cortes, C. J. C. Burges, The MNIST database of handwritten digits. Available online: 
http://yann.lecun.com/exdb/mnist/ 

[15] X. Li, J. K. Bradley, S. Pawar, K. Ramchandran, Robustifying the sparse Walsh-Hadamard trans¬ 
form without increasing the sample complexity of 0{K\ogN), Available online: https://www.eecs. 
berkeley.edu/-kannanr/assets/proj ect_ffft/WHT_noisy.pdf 

[16] N. Linial, Y. Mansour, N. Nisan, Constant depth circuits, Fourier transform, and learnability, J. ACM 
40 (1993) 607-620. 

[17] G. F. Montufar, R. Pascanu, K. Cho, Y. Benjio, On the number of linear regions of deep neural networks, 
NIPS (2014) 1-9. 

[18] A. Shpilka, A. Tal, B. lee Volk, On the structure of Boolean functions with small spectral norm. Available 
online: arXiv: 1304.0371 

[19] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, DeepFace: Closing the gap to human-level performance in 
face verification, CVPR (2014) 1701-1798. 

[20] E. van den Berg, M. P. Friedlander, Theoretical and empirical results for recovery from multiple mea¬ 
surements, IEEE Trans. Inform. Theory 56 (2010) 2516-2527. 

[21] V. N. Vapnik, The nature of statistical learning theory, 2nd ed., Springer, 2000. 

[22] J. Zhu, S. Rosset, T. Hastie, R. Tibshirani, 1-norm support vector machines, NIPS 16 (2004) 49-56. 



